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Abstract 

We have analysed an effect of the Bak-Sneppen predator-prey food- 
chain self-organization on nucleotide content of evolving species. In our 
model, genomes of the species under consideration have been represented 
by their nucleotide genomic fraction and we have applied two-parameter 
Kimura model of substitutions to include the changes of the fraction in 
time. The initial nucleotide fraction and substitution rates were decided 
with the help of random number generator. Deviation of the genomic 
nucleotide fraction from its equilibrium value was playing the role of the 
fitness parameter, B, in Bak-Sneppen model. Our finding is, that the 
higher is the value of the threshold fitness, during the evolution course, 
the more frequent are large fluctuations in number of species with strongly 
differentiated nucleotide content; and it is more often the case that the 
oldest species, which survive the food-chain competition, might have spe- 
cific nucleotide fraction making possible generating long genes. 
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1 Model introduction 

To understand the way the higher organized species emerge during evolution 
we consider very simple model of evolving food chain consisting of N species. 
In the model, each species is represented by nucleotide composition of their 
DNA sequence and the substitution rates between the nucleotides. There are 
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four possible nucleotides, A, T, G, C, in a DNA sequence. In our model, the 
DNA sequence is represented, simply, by four reals, Fa,Ft,Fc,Fc, being the 
nucleotide fractions and 

F A +F T + F G +Fc = l. (1) 

The nucleotide fractions depend on time due to mutations and selection. 

Our model is originating from the Bak-Sneppen model of co-evolution 
and Kimura's neutral mutation hypothesis (|2],[3j). According to Kimura's hy- 
pothesis, neutral mutations are responsible for molecular diversity of species. 
In 1980, Kimura introduced two-parameter model 0], [5], where the transi- 
tional substitution rate (substitutions A <-> G and C <-> T) is different from the 
transversional rate (substitutions A^T,G^T,A^C,G^C) . If we use 
Markov chain notation, with discrete time t, then the transition matrix, M nuc i, 
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representing rates of nucleotide substitutions in the two-parameter Kimura 
model fulfills the following equation 



i?(f + 1) = M nucl ^(t) 



(3) 



where F(t)~{FA(t), Fx{t), Fc(t), Fc(t)} T denotes nucleotide fractions at time 
t, u represents substitution rate and the symbols Wij = s for transitions and 
Wij — v for transversions (i,j = A,T,G,C) represent relative substitution 
probability of nucleotide j by nucleotide i. Wij satisfy the equation 

E W V = !> 

i,j=A,T,G,C 



(4) 



which in the case of the two-parameter Kimura model is converted into the 
following 



4s + 8u = 1, 



(5) 



Evolution described by Eq.ljSJ) has the property that starting from some 
initial value of i^(fo) a t t — to the solution tends to an equilibrium in which 
Fa = Ft — Fg — Fc = 0.25. The example of this type of behavior has been 
presented in Fig^ The two-parameter Kimura approximation is one of the 
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Figure 1: Dependence of nucleotide fractions on time in two-parameter Kimura 
model. Here, the initial fractions take the following values: Fa — 0.320964, 
F T = 0.246541, F G = 0.0252434, F c = 0.407252. Besides, there has been 
plotted the maximum absolute deviation from the difference \Fa — Fr\ and 
\Fg — Fc\ (the dashed curve). 

simplest models of nucleotide substitutions. For example, in reconstructing the 
phylogenetic trees, one should use a more general form of the transition matrix 
in Eq.® (p3,[E|)iE],!I|)- This is not necessary in our model, where we need only 
the property that the nucleotide frequencies are evolving to their equilibrium 
values. 

More complicated prey-predator relations were simulated with a 5 x 5 Chowd- 
hury lattice with a fixed number of six food levels. Each lower (prey) level 
contains twice as many possible species as the upper (predator) level. Also this 
model does not contain an explicit bit-string as genome. We now introduced a 
composition vector F(t) as above, different for each different species, and let it 
evolve according to Eq.(3). Again, after many iterations all four fractions ap- 
proached 0.25. This result, as we will show below, is qualitatively different from 
that in the model defined below, where we observe fluctuations of nucleotide 
frequency, instead. 

Our model consists of N species and for each species we define the set of 
random parameters, Fa, Ft, Fq, Fc, u, s, v, which satisfy only two equations, 
Eq.QJ and Eq.JHJ), and we assume that 4s > 8v to fulfill the condition that 
transitions (s) dominate transversions (v). The nucleotide fractions, represent- 
ing each species, change in time according to Eq.©. 

The species are related according to food-chain. In the case of the nearest- 
neighbor relation the species i + 1 preys on species i. The extension to further 
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neighbors follows the same manner. The food-chain has the same dynamics as 
in Bak-Sneppen model (BS) 1 , i.e., every discrete time step t, we choose the 
species i with minimum fitness Bi and the chosen species is replaced by a new 
one together with the species linked to it with respect to food-chain relation. 
In the original BS model the nearest neighborhood of species i is symmetrical, 
e.g. {£?i-i, Bi, Bi+i}. The asymmetrical (directional) neighborhood applied 
for food-chain modeling has been discussed by Stauffer and Jan |5] and their 
results were qualitatively the same as in the BS model. The generalizations of 
food-chain onto more complex food- web structures are also available ^0]> P"T| - 
The new species, substituting the old ones, obtain new random values Fa, 
Ft, Fa, Fc, u, s, v. In our model the fitness Bi of the species i = 1, 2, . . . , N 
is represented by the parameter 

Bi=l-D, D = m& X (\F A -F T \,\F G -Fc\), (6) 

where Bi G [0, 1] is a measure of the deviation from equilibrium of the nucleotide 
numbers Fa — Ft and Fq — Fc- Thus, the species with the smallest value of 
Bi (largest compositional deviation from equilibrium) are eliminated together 
with their right-hand neighbors with respect to food-chain. This elimination 
mechanism leads to self-organization. Namely, in the case of finite value of N 
the statistically stationary distribution of the values of Bi (i = 1,2, ... , N) is 
achieved after finite number of time steps with the property that the selected 
species with the minimum value B m i n is always below some threshold value 
B c or it is equal to the value. The typical snapshot, at transient time, of the 
distribution of the values of Bi is presented in Fig|3 

So, if FigI3looks much the same as it had been resulted from the simulation 
of pure BS model, then what are the new results in our model? In the following, 
we will show that the higher value of the threshold fitness, during the evolution 
course, it is often the case that the winners of the food-chain competition become 
also species with specific nucleotide composition, which is generating long genes. 

2 Discussion of results 

We know, from Eq.Q (see also Fig|TJ, that a single species tends to posses 
equilibrium nucleotide composition, which in this simple two-parameter Kimura 
model means asymptotically the same nucleotide composition Fa = Ft = Fg = 
Fc = 0.25. The only distinction, which we could observe, if we had used a 
more general form of the substitution table, could be the resulting equilibrium 
nucleotide composition different from the uniform one. This would bring nothing 
new to the qualitative behavior of our model. 

Once, in the model under consideration, nucleotide composition of species is 
changing according to Eq.(J3J), the species fitness Bi depends on time. It is not 
the case in the BS model £Q, where the fitness of the evolving species is constant 
in time unless it is extincted. Although Bi depends on time, the food-chain 
selection rule introduces mechanism, which forbids to achieve the equilibrium 
nucleotide composition (Bi = 1). Instead, there appears a threshold value of 
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Figure 2: Snapshot of the distribution of the species fitness at the transient time 
t = 5000. In the example, N — 200 and the substitution rate is a random real 
u = 0.01 * md, number of the nearest-neighbors n = 1. The horizontal line is 
representing the value of threshold fitness. 
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Figure 3: Few examples of time dependence of the threshold fitness B for dif- 
ferent values of the upper bound of the applied substitution rate u. 
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Figure 4: Distribution P(k) of gene length k in Chromosome IV of Saccha- 
romyces cerevisiae genome and in the case of the approximate analytic formula 
(Eq.JJ|)), where the nucleotide fractions take the values as in Chr. IV, i.e., 
F A = 0.31121, F T = 0.309727, F G = 0.190188, F c = 0.188875. Parameter k is 
representing number of codons (nucleotide triplets). 

B c , below which the species become extinct. In our model the threshold value 
depends on substitution rate u. The examples of this dependence for transient 
time of 10 9 generations have been plotted in FigEl 

Similarly, as in BS model, the SOC phenomenon disappears if the number 
of nearest neighbors n = 0. Then, all species tend to the state with B = 1. 

We will discuss the influence of threshold fitness optimization on nucleotide 
composition of species and, in consequence, its influence on the possible max- 
imum length of gene in species genome. To this aim, we assume that a gene 
has continuous structure (no introns) and it always starts from codon START 
(ATG) and ends with codon STOP (TGA, TAG or TAA). Then, the probability 
of generating any gene consisting of k nucleotide triplets in a random genome 
with the fractions Fa, Ft, Fq, Fc could be approximated by the following 
formulae (see also |13|): 



P(k) = aF A F T F G {2F A F T F G + F%F T )(1 - 2F A F T F G - FIFt)^ 1 , (7) 



where a is a normalization constant, which can be derived from the normaliza- 
tion condition 




(8) 
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The value of k cu t s in Eq. (JSJ could be associated with genome size. 

In Fig0| there has been shown the relation between the empirical distribu- 
tion of gene length k in chromosome IV of Saccharomyces cerevisiae genome 
and the distribution P(k) in Eq.Q). Similar results we could obtain for other 
genomes. One can observe, that the approximation in Eq.Q is acceptable for 
small gene size, whereas it becomes wrong for large gene size. Generally, it is 
accepted that there is direct selective pressure on gene size for the effect. Ex- 
amples of papers discussing the problem could be found together 
with analyses of rich experimental data. 

The lowest frequency of gene size, fc, in Saccharomyces cerevisiae genome is 
equal to Pq ~ 2.8 x 10~ 5 (Fig^J. In many natural genomes Pq takes value of 
the same order of magnitude, e.g., in the B. burgdorferi genome Pq ~ 5.7 x 1CP 5 . 
In our model, we have assumed that for all species holds Pq — 1 x 10~ 6 . We 
have also introduced maximum gene length, fc max , which is the largest value of 
k for which P(k) > P . 

In the particular case of the same fractions of nucleotides in genome (Fa = 
Ft = Fq = Fc — 0.25) the limiting value k = fc max for which P(k) > Pq is 
equal to k max = 225 nucleotide triplets (675 nucleotides). Thus, in our model, 
we could expect that for the oldest species the maximum gene length fc max 
should not exceed the value of 675 nucleotides. The reason for that is that ageing 
species should approach equilibrium composition (Fig^l. However, surprisingly, 
we found that the self-organization phenomenon enforces a state, in which the 
oldest species may have much longer gene sizes than in genome with nucleotide 
composition corresponding to equilibrium composition. Actually, there start to 
appear fluctuations in the number of species with very short genes and very 
long ones. 

The selection towards the species with the smallest deviation from equilib- 
rium nucleotide composition (the largest value of B) implicates that the species, 
which survive the selection, may have specific bias in nucleotide composition, 
which makes possible generating long genes. In our model, we have observed 
abundances of G+C content in the species with long genes. During simulation 
run, in each time step i, we have collected in a file data representing age of the 
oldest species, the corresponding gene size k and nucleotide frequency. We have 
observed, that the closer the species fitness Bi is to the threshold fitness the 
older might be the species and also the species might posses longer genes in its 
genome. There is no such effect in the case, when n = 0. Even if there could 
appear, at some early time interval, a tendency to generate longer genes, this 
property would have disappeared after longer evolution time of the system of N 
species. In Fig|5| we have plotted time dependence of the recorded maximum 
length of gene in the oldest species and the corresponding Guanine fraction. One 
can compare this figure with FigEl where there are no prey-predator relation 
in the ecosystem (n = 0) . In the latter case, the system is ageing in accordance 
with the Eq.@ and Bi — > 1 (i = 1, 2, . . . , N) and self-organization has not been 
observed. 

The observed property of the competing species has an analogy with the 
behavior of the model of evolution of evolving legal rules in Anglo-American 
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Figure 5: Time dependence of maximum gene size, k max , of the oldest species 
and the Guanine content in their genome in the evolving ecosystem when N — 
500, u — 0.1 * rnd, n = 1. The data in the figure have been decimated. 
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Figure 6: The same parameters as in Fig0but n = 0. 
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Figure 7: Maximum gene length, fc max , versus genomic fraction of nucleotide in 
the oldest species. The vertical lines correspond to equilibrium genome. In our 
model this means Fa — Ft = Fq = Fc = 0.25. The same parameters as in 
Fig'Elhave been used. 

court, introduced by Yee |15| (see Fig. 3 in his paper). 

The relation between nucleotide fraction of genome and the possible maxi- 
mum length of gene in such genome has been shown in the histogram in Fig[7| 
The presented data address, solely, the oldest species. Notice, A « T and 
G « C for genomes both with short genes and long ones, whereas A w T w G « 
C « 0.25 for genomes with nucleotide composition near equilibrium understood 
in terms of the two-parameter Kimura model. We should remember, that the 
substitution table for the two-parameter Kimura model (Eq|3J) is a symmet- 
ric matrix and the observed compositional asymmetry results directly from the 
predator-prey self-organization phenomenon. The right-hand wings, evident in 
the structure in Fig0 do vanish in the case when n — in spite of the same 
fitness parameter in EqH3 applied for selection. 

We have not included strand structure in species genome, in our model, 
since it is represented only with the help of nucleotide fraction. Lobry and 
Sueoka, in their paper |16j . concluded that if there are no bias in mutation and 
selection between two DNA strands, then it is expected A « T and G « C 
within each strand, and that the observed variation in G+C content between 
species is an effect of another phenomenon than, simply, asymmetric mutation 
pressure. Here, we have shown, that such compositional fluctuations of genome 
could result from ecosystem optimization - no direct selection on genes length 
is present in our model. 

The predator-prey rule, in the model under consideration, introduces large 
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fluctuations in nucleotide frequency in the ageing ecosystem, if it is sufficiently 
old. However, we have not observed this frequency phenomenon in modeling 
speciation on the Chowdhury lattice as we stated in the beginning. Af- 
ter we have introduced a small change of our model, in such a way, that new 
species arising in the speciation process were taken always from among the sur- 
vived species, and we only slightly were modifying their nucleotide frequency by 
introducing d% of changes in their values, then the observed by us fluctuations 
ceased to exist in the limit d — > 0, as found in the Chowdhury model. 

3 Conclusions 

The specific result of the food-chain self-organization of the competing species is 
that the oldest survivors of the competition might posses strong compositional 
bias in nucleotides, the abundance of G+C content. In our model, this resulting 
asymmetry makes possible generating long genes. There was no direct selection 
applied on the gene length, in the model. The fluctuation in number of species 
with long genes and short genes represents rather undirectional noise, the am- 
plitude of which is increasing while the ecosystem is ageing. The effect ceases to 
exist if there is no species competition. The same is if we allow only d% changes 
of nucleotide frequency in the new formed species, in the limit d — > 0. 

It could be, that the observed self-organization is an attribute of genes in 
genome evolution. Typically, many genes are coupled together in genome in a 
hierarchical dynamical structure, which resembles complex food-web structure. 
Some genes may be duplicated but also you can observe fusion of genes or even 
genomes. 
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